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Abstract 

This paper proposes a “quasi-synchronous” design approach for signal processing circuits, in which 
timing violations are permitted, but without the need for a hardware compensation mechanism. The case 
of a low-density parity-check (LDPC) decoder is studied, and a method for accurately modeling the 
effect of timing violations at a high level of abstraction is presented. The error-correction performance 
of code ensembles is then evaluated using density evolution while taking into account the effect of 
timing faults. Following this, several quasi-synchronous LDPC decoder circuits based on the offset 
min-sum algorithm are optimized, providing a 23 %- 40 % reduction in energy consumption or energy- 
delay product, while achieving the same performance and occupying the same area as conventional 
synchronous circuits. 


I. Introduction 

The time required for a signal to propagate through a CMOS circuit varies depending on several 
factors. Some of the variation results from physical limitations: the delay depends on the initial 
and final charge state of the circuit. Other variations are due to the difficulty (or impossibility) 
of controlling the fabrication process and the operating conditions of the circuit [1]. As process 
technologies approach atomic scales, the magnitude of these variations is increasing, and reducing 
the supply voltage to save energy increases the variations even further [2]. 
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The variation in propagation delay is a source of energy inefficiency for synchronous circuits 
since the clock period is determined by the worst delay. One approach to alleviate this problem is 
to allow timing violations to occur. While this would normally be catastrophic, some applications 
(in signal processing or in error-correcting decoding, for example) can tolerate a degradation in 
the operation of the circuit, either because an approximation to the ideal output suffices, or 
because the algorithm intrinsically rejects noise. This paper proposes an approach to the design 
of systems that are tolerant to timing violations. In particular we apply this approach to the design 
of energy-optimized low-density parity-check (LDPC) decoder circuits based on a state-of-the-art 
soft-input algorithm and architecture. 

Other approaches have been previously proposed to build synchronous systems that can tolerate 
some timing violations. In better than worst-case (BTWC) [3] or voltage over-scaled (VOS) 
circuits, a mechanism is added to the circuit to compensate or recover from timing faults. One 
such method introduces special latches that can detect timing violations, and can trigger a restart 
of the computation when needed [4], [5]. Since the circuit’s latency is increased significantly 
when a timing violation occurs, this approach is only suitable for tolerating small fault rates (e.g., 
10“^) and for applications where the circuit can be easily restarted, such as microprocessors that 
support speculative execution. 

In most signal processing tasks, it is acceptable for the output to be non-deterministic, which 
creates more possibilities for dealing with timing violations. A seminal contribution in this area 
was the algorithmic noise tolerance (ANT) approach [6], [7], which is to allow timing violations 
to occur in the main processing block, while adding a separate reliable processing block with 
reduced precision that is used to bound the error of the main block, and provide algorithmic 
performance guarantees. The downside of the ANT approach is that it relies on the assumption 
that timing violations will first occur in the most significant bits. If that is not the case, the 
precision of the circuit can degrade to the precision of the auxiliary block, limiting the scheme’s 
usefulness. For many circuits, including some adder circuits [8], this assumption does not hold. 
Furthermore, the addition of the reduced precision block and of a comparison circuit increases 
the area requirement. 

We propose a design methodology for digital circuits with a relaxed synchronicity requirement 
that does not rely on any hardware compensation mechanism. Instead, we provide performance 
guarantees by re-analyzing the algorithm while taking into account the effect of timing violations. 
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We say that such systems are quasi-synchronous. LDPC decoding algorithms are good candidates 
for a quasi-synchronous implementation because their throughput and energy consumption are 
limiting factors in many applications, and like other signal processing algorithms, their perfor¬ 
mance is assessed in terms of expected values. Furthermore, since the algorithm is iterative, there 
is a possibility to optimize each iteration separately, and we show that this allows for additional 
energy savings. 

The topic of unreliable LDPC decoders has been discussed in a number of contributions. 
Varshney studied the Gallager-A and the Sum-Product decoding algorithms when the compu¬ 
tations and the message exchanges are “noisy”, and showed that the density evolution analysis 
still applies [9]. The Gallager-B algorithm was also analyzed under various scenarios [10]-[12]. 
A model for an unreliable quantized Min-Sum decoder was proposed in [13], which provided 
numerical evaluation of the density evolution equations as well as simulations of a finite-length 
decoder. Faulty finite-alphabet decoders were studied in [14], where it was proposed to model 
the decoder messages using conditional distributions that depend on the ideal messages. The 
quantized Min-Sum decoder was also analyzed in [15] for the case where faults are the result of 
storing decoder messages in an unreliable memory. The specific case of faults caused by delay 
variations in synchronous circuits is considered in [16], where a deviation model is proposed for 
binary-output circuits in which a deviation occurs probabilistically when the output of a circuit 
changes from one clock cycle to the next, but cannot occur if the output does not change. While 
none of these contributions explicitly consider the relationship between the reliability of the 
decoder’s implementation and the energy it consumes, there have been some recent developments 
in the analysis of the energy consumption of reliable decoders. Lower bounds for the scaling 
of the energy consumption of error-correction decoders in terms of the code length are derived 
in [17], and tighter lower bounds that apply to LDPC decoders are derived in [18]. The power 
required by regular LDPC decoders is also examined in [19], as part of the study of the total 
power required for transmitting and decoding the codewords. 

In this paper, we present a modeling approach that provides an accurate representation of 
the deviations introduced in the output of an LDPC decoder processing circuit in the presence 
of occasional timing violations, while simultaneously measuring its energy consumption. We 
show that this model can be used as part of a density evolution analysis to evaluate the channel 
threshold and iterative performance of the decoder when affected by timing faults. Finally, we 
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show that under mild assumptions, the problem of minimizing the energy consumption of a 
quasi-synchronous decoder can be simplified to the energy minimization of a small test circuit, 
and present an approximate optimization method similar to Gear-Shift Decoding [20] that finds 
sequences of quasi-synchronous decoders that minimize decoding energy subject to performance 
constraints. 

The remainder of the paper is organized as follows. Section II reviews LDPC codes and 
describes the circuit architecture of the decoder that is used to measure timing faults. Sec¬ 
tion III presents the deviation model that represents the effect of timing faults on the algorithm. 
Section IV then discusses the use of density evolution and of the deviation model to predict 
the performance of a decoder affected by timing faults. Finally, Section V presents the energy 
optimization strategy and results, and Section VI concludes the paper. Additional details on the 
CAD framework used for circuit measurements can be found in Appendix A, and Appendix B 
provides some details concerning the simulation of the test circuits. 

II. LDPC Decoding Algorithm and Architecture 
A. Code and Channel 

We consider a communication scenario where a sequence of information bits is encoded using 
a binary LDPC code of length n. The LDPC code described by an m x n binary parity-check 
matrix H = j] consists of all length-n row vectors v satisfying the equation vH^ = 0. 
Equivalently, the code can be described by a bipartite Tanner graph with n variable nodes (VN) 
and m check nodes (CN) having an edge between the Ath variable node and the j-th check 
node if and only if hj^i ^ 0. We assume that the LDPC code is regular, which means that in the 
code’s Tanner graph each variable node has a fixed degree d^ and each check node has a fixed 
degree dc- 

Let us assume that the transmission takes place over the binary-input additive white Gaussian 
noise (BIAWGN) channel. A codeword x G { — 1,1}"^ is transmitted through the channel, which 
outputs the received vector y = x + w, where m is a vector of n independent and identically 
distributed (i.i.d.) zero-mean normal random variables with variance We use Xi and yi to 
refer to the input and output of the channel at time i. The BIAWGN channel has the property of 
being output symmetric, meaning that (i)y^\xi (g | 1) = 4>yi\xi (—g I —1), and memoryless, meaning 
that 4)y\x {q \r) = Y\d=i 4>y^\xi (g* | a). Throughout the paper, 0(-) denotes a probability density 
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function. The BIAWGN channel can also be described multiplicatively as y = xz, where z is 
a vector of i.i.d. normal random variables with mean 1 and variance 
Let the belief output /r* of the channel at time i be given by 


aUi 
bi — 2 


( 1 ) 


with a > 0. Note that if a = 2 then /ij is a log-likelihood ratio. Assuming that xi = 1 was 
transmitted, then /Xj has a normal distribution with mean a/a'^ and variance Writing 

p = a/(j"^, WQ see that pi is Gaussian with mean p and variance ap, that is, the distribution of pi 
is described by a single parameter p. We call this distribution a one-dimensional (1-D) normal 
distribution. The distribution of pi can also be specified using other equivalent parameters, such 
as the probability of error pe, given by 


Pe = P {pi < ^\xi = 1) = P (/ii > Q\xi = -1) = - erfc 
where erfc(-) is the complementary error function. 


(tp) 


= - erfc [ \ l^ 
2a 


( 2 ) 


B. Decoding Algorithm 

The well-known Offset Min-Sum (OMS) algorithm is a simplified version of the Sum-Product 
algorithm that can usually achieve similar error-correction performance. It has been widely 
used in implementations of LDPC decoders [21]-[23]. To make our decoder implementation 
more realistic and show the flexibility of our design framework, we present an algorithm and 
architecture that support a row-layered message-passing schedule. Architectures optimized for 
this schedule have proven effective for achieving efficient implementations of LDPC decoders 
[22]-[24]. Using a row-layered schedule also allows the decoder to be pipelined to increase the 
circuit’s utilization. In a row-layered LDPC decoder, the rows of the parity-check matrix are 
partitioned into L sets called layers. To simplify the description of the decoding algorithm, we 
assume that all the columns in a given layer contain exactly one non-zero element. This implies 
that L = dy. Note that codes with at most one non-zero element per column and per layer can 
also be supported by the same architecture, simply requiring a modification of the way algorithm 
variables are indexed. 

Let us define a set Ce containing the indices of the rows of H that are part of layer i, i E [1, L]. 
We denote by pfj a message sent from VN i to CN j during iteration t. and by A-*] a message 
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sent from CN j to VN i. It is also useful to refer to the CN neighbor of a VN i that is part 
of layer i. Beeause of the restrietion mentioned above, there is exaetly one sueh CN, and we 
denote its index by Finally, we denote the ehannel information eorresponding to the i-th 

eodeword bit by fif'\ sinee it also eorresponds to the first message sent by a variable node i to 
all its neighboring eheek nodes. 

The Offset Min-Sum algorithm used with a row-layered message-passing sehedule is deseribed 
in Algorithm 1. In the algorithm, A/'(j) denotes the set of indiees eorresponding to VNs that are 
neighbors of a eheek node j, and Aj represents the eurrent sum of ineoming messages at a VN 
i. The funetion mini 2(5') returns the smallest and seeond smallest values in the set S', C > 0 
is the offset parameter, and 

I 1 if a; > 0, 
sgn(a;) = < 

— 1 if a; < 0. 


C. Architecture 

The Tanner graph of the eode ean also be used to represent the eomputations that must be 
performed by the deeoder. At eaeh deeoding iteration, one message is sent from variable to 
eheek nodes on every edge of the graph, and again from eheek to variable nodes. We eall a 
variable node proeessor (VNP) a eireuit bloek that is responsible for generating messages sent by 
a variable node, and similarly a eheek node proeessor (CNP) a eireuit bloek generating messages 
sent by a eheek node. 

In a row-layered arehiteeture in whieh the eolumn weight of layer subsets is at most 1, there 
is at most one message to be sent and reeeived for eaeh variable node in a given layer. Therefore 
VNPs are responsible for sending and reeeiving one message per eloek eyele. CNPs on the other 
hand reeeive and send dc messages per eloek eyele. At any given time, every VNP and CNP is 
mapped respeetively to a VN and a CN in the Tanner graph. The routing of messages from VNPs 
to CNPs and baek ean be posed as two equivalent problems. One ean fix the mapping of VNs 
to VNPs and of CNs to CNPs, and find a permutation of the message sequenee that matehes 
VNP outputs to CNP inputs, and another permutation that matehes CNP outputs to VNP inputs. 
Alternatively, if VNPs proeess only one message at a time, one ean fix the eonneetions between 
VNPs and CNPs, and ehoose the assignment of VN to VNPs to aehieve eorreet message routing. 
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We choose the later approach because it allows studying the computation circuit without being 
concerned by the routing of messages. 

The number of CNPs instantiated in the decoder can be adjusted based on throughput require¬ 
ments from 1 to mjL (the number of rows in a layer). As the number of CNPs is varied, the 
number of VNPs will vary from dc to n. An architecture diagram showing one VNP and one 
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Fig. 1. Block diagram of the layered Offset Min-Sum decoder architecture. 


CNP is shown in Fig. 1. In reality, a CNP is connected to dc — 1 additional VNPs, which are 
not shown. The memories storing the belief totals Aj and the intrinsic beliefs Xfj are also not 
shown. The part of the VNP responsible for sending a message to the CNP is called VNP front 
and the part responsible for processing a message received from a CNP is called the VNP back. 
The VNP front and back do not have to be simultaneously mapped to the same VN. This allows 
to easily vary the number of pipeline stages in the VNPs and CNPs. Fig. 1 shows the circuit 
with two pipeline stages. 

Messages exchanged in the decoder are fixed-point numbers. The position of the binary 
point does not have an impact on the algorithm, and therefore the messages sent by VNs in 
the first iteration can be defined as rounding the result of (1) to the nearest integer, while 
choosing a suitable a. The number of bits in the quantization, the scaling factor a, and the OMS 
offset parameter are chosen based on a density evolution analysis of the algorithm (described in 
Section IV). We quantize decoder messages to 6 bits, which yields a decoder with approximately 
the same channel threshold as a floating-point decoder under a standard fault-free implementation. 

In order to analyze a circuit that is representative of state-of-the-art architectures, we use an 
optimized architecture for finding the first two minima in each CNP. Our architecture is inspired 
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(a) Sort block. (b) Merge block. 

Fig. 2. Logic blocks used in the MINi ,2 unit. 


by the “tree strueture” approaeh presented in [25], but requires fewer eomparators. Each pair 
of CNP inputs is first sorted using the Sort block shown in Fig. 2a. These sorted pairs are 
then merged recursively using a tree of Merge blocks, shown in Fig. 2b. If the number of CNP 
inputs is odd, the input that cannot be paired is fed directly into a special merge block with 3 
inputs, which can be obtained from the 4-input Merge block by removing the min 2 b input and 
the bottom multiplexer. 

Note that it is possible that changes to the architecture could increase or decrease the robustness 
of the decoder (see e.g. [26]), but this is outside the scope of this paper. 

III. Deviation Model 

A. Quasi-Synchronous Systems 

We consider a synchronous system that permits timing violations without hardware compensa¬ 
tion, resulting in what we call a quasi-synchronous system. Optimizing the energy consumption 
of these systems requires an accurate model of the impact of timing violations, and of the energy 
consumption. We propose to achieve this by characterizing a test circuit that is representative of 
the complete circuit implementation. 

The term deviation refers to the effect of circuit faults on the result of a computation, and 
the deviation model is the bridge between the circuit characterization and the analysis of the 
algorithm. We reserve the term error for describing the algorithm, in the present case to refer to 
the incorrect detection of a transmitted symbol. A timing violation occurs in the circuit when the 
propagation delay between the input and output extends beyond a clock period. Modeling the 
deviations introduced by timing violations is challenging because they not only depend on the 
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Fig. 3. Computation tree of an LDPC decoder combined with the deviation model (for a regular LDPC code with dc = 4 and 
dv = 3). 


current input to the circuit, but also on the state of the circuit before the new input was applied. 
In general, timing violations also depend on other dynamic factors and on process variations. 

In this paper, we focus on the case where the output of the circuit is entirely determined 
by the current and previous inputs of the circuit, and by the nominal operating condition of 
the circuit. We denote by T the set of possible operating conditions, represented by vectors of 
parameters, and by 7 G F a particular operating condition. For example, an operating condition 
might specify the supply voltage and clock period used in the circuit. We assume that all the 
parameters specified by 7 are deterministic. 

B. Test Circuit 

The operation of an LDPC decoder can be represented using its one-iteration computation 
tree, which models the generation of a VN-to-CN message in terms of messages sent in 
the previous iteration. There are {dy — 1) check nodes in the tree. Each of these check nodes 
receives {dc — 1) messages from neighboring variable nodes, and generates a message sent to 
the one VN whose message was excluded from the computation. This VN then generates an 
extrinsic message based on the channel prior /r® and on the messages received from neighboring 
check nodes. An example of a computation tree is shown within the dashed box in Fig. 3. For 
convenience, we choose to measure deviations on an implementation of this computation tree, so 
that measurements directly correspond with the progress made by the decoder in one iteration. 
As discussed in more details in Appendix B, the basic processing block of a row-layered decoder 
handles the messages to and from one check node. The test circuit is therefore built by re-using 
a basic block dy — 1 times. 
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Since the test circuit is synchronous, we can represent it as a discrete-time system. Let 
be the input at clock cycle k. When timing violations are allowed to occur, the corresponding^ 
circuit output Zk can be expressed as Zk = g{Xk,Sk), where Sk represents the state of the 
circuit at the beginning of cycle k, and g is some deterministic function. Equivalently, we can 
write = g{n^^\Sk), where is a vector containing all the VN-to-CN messages that 

form the input of the computation tree, and is a VN-to-CN message that will be sent in 

the next iteration. 

A sequence of message vectors can be mapped to a sequence of circuit inputs Xk in 
various ways. As is common with the type of decoder architecture considered here, we assume 
that all processing circuits are re-used several times during the same iteration t and layer i. 
Therefore, for a fixed the sequence of circuit inputs X^ forms an i.i.d. process. Since 
depends on input Xk-i (and possibly also on other previous inputs), but not on Xk, Sk and Xk 
are independent. At the output, Zk and Zk-i are not independent, but it is possible to design the 
architecture so that correlated outputs are not associated with the same Tanner graph nodes or 
with neighboring nodes. This occurs naturally in a row-layered architecture, since each variable 
node is only updated once in each layer. Therefore, it is sufficient to consider the marginal 
distribution of the circuit’s output, neglecting the correlation in successive outputs. 

C. Deviation Model 

We have seen above that a decoder message can be expressed as a function of the 

messages received by the neighboring check nodes in the previous iteration and of the state 
of the processing circuit. To separate the deviations from the ideal operation of the decoder, it is 
helpful to decompose a decoding iteration into the ideal computation, followed by a transmission 
through a deviation channel. This model is shown in Fig. 3, where is the message that 

would be sent from variable node i to check node j during iteration t -f 1 if no deviations had 
occurred during iteration t. For the first messages sent in the decoder at t = 0, the computation 
circuits are not used and therefore no deviation can occur, and we simply have /i-^ = Since 
we neglect correlations in successive circuit outputs, the deviation channel is memoryless. 

*The circuit could require one or several clock cycles to generate the first output, but this is irrelevant to the characterization 
of the computation. 
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Unlike typical channel models where the noise is independent from other variables in the 
system, the deviation ofj is a function of the current circuit input Xk = and of the current 
state Sk. However, modeling deviations directly in terms of would make the model too 
complex, because of the large dimensionality of the input. To simplify the model, we consider 
only the value of the current output, and model deviations in terms of the conditional distribution 
^ approach that was also used in [14]. To improve the accuracy of the model, 

it is also possible to consider the value of the transmitted bit Xi associated with VN i. 

Since the faulty messages depend on the circuit state, the deviation model is obtained by 
averaging over the states Sk'- 

Sk 

which in practice can be done using a Monte-Carlo simulation of the test circuit. 

D. Generalized Deviation Model 

When evaluating deviations based on (3), it is important to keep in mind that Sk depends on 
previous circuit inputs. Under the assumption that the previous use of the circuit belonged to the 
same (f, f), (f){Sk) is a function of As a result, the model described by (3) is only valid 

for a fixed message distribution. Furthermore, because the message distribution depends on the 
transmitted codeword, the deviation model also depends on the transmitted codeword. 

Let us first assume that the transmitted codeword is fixed. In this case, the message distribution 
4>{nfj) depends on the channel noise, on the iteration index t, and on the operating condition 
of the circuit. Since the messages are affected by deviations for t > 0, only ) is known 
a priori. An obvious way to measure deviations is to perform a first evaluation of (3) using 
the known ), and to repeat the process for each subsequent decoding iteration. However, 
the resulting deviation model is of limited interest, since it depends on the specific message 
distributions in each iteration. 

To generate a model that is independent of the iterative progress of the decoder, we first 
approximate as a 1-D Normal distribution with error rate parameter pe'^ chosen such that 



Note that while (pififj) does correspond exactly to a 1-D Normal distribution, this is not 
necessarily the case after the first iteration. This approximation is the price to pay to obtain 
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a standalone deviation model, but note that exact message distributions can still be used when 
evaluating the performance of the faulty decoder. In fact, combining a density evolution based 
on exact distributions with a deviation model generated using 1-D Normal distributions leads to 
very accurate predictions in practice [27]. 

To construct the deviation model, we perform a number of Monte-Carlo simulations of (3) 
using 1-D input distributions with various pe'^ values. Interpolation is then used to obtain a 


continuous model in pe^. The simulations are also performed for all operating conditions 7 G T. 
We therefore obtain a model that consists of a family of conditional distributions, indexed by 
{pe\'j), that we denote as 



. .(*+ 1 ) 
i,j ’ 



(5) 


However, we generally omit the {p^p, 7) superscript to simplify the notation. While measuring 
deviations, we also record the switching activity in the circuit, which is then used to construct 
an energy model that depends on 7 and pe \ denoted as c^{pe ^) (where c stands for “cost”). 

To use the model, we first determine the error rate parameter pe'^ corresponding to the 
distribution of the messages at the beginning of the iteration, and we then retrieve the 
appropriate conditional distribution, which also depends on the operating condition 7 of the 
circuit. This conditional distribution then informs us of the statistics of deviations that occur at 
the end of the iteration, that is on messages sent in the next iteration. 

As mentioned above, since depends on the transmitted codeword, this is also the 

case of 0(S'fc) and of the deviation distributions. We show in Section IV that the codeword 
dependence is entirely contained within the deviation model and does not affect the analysis 
of the decoding performance, as long as the decoding algorithm and deviation model satisfy 
certain properties. Nonetheless, we would like to obtain a deviation model that does not depend 
on the transmitted codeword. This can be done when the objective is to predict the average 
performance of the decoder, rather than the performance for a particular codeword, since it is 
then sufficient to model the average behavior of the decoder. For the case where all codewords 
have an equal probability of being transmitted, we propose to perform the Monte-Carlo deviation 
measurements by randomly sampling transmitted codewords. This approach is supported by the 
experimental results presented in [27], which show that a deviation model constructed in this 
way can indeed accurately predict the average decoding performance. 
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IV. Performance Analysis 
A. Standard Analysis Methods for LDPC Decoders 

Density evolution (DE) is the most eommon tool used for predieting the error-eorreetion 
performance of an LDPC decoder. The analysis relies on the assumption that messages passed 
in the Tanner graph are mutually independent, which holds as the code length goes to infinity 
[28]. Given the channel output probability distribution and the probability distribution of variable 
node to check node messages at the start of an iteration, DE computes the updated distribution 
of variable node to check node messages at the end of the decoding iteration. This computation 
can be performed iteratively to determine the message distribution after any number of decoding 
iterations. The validity of the analysis rests on two properties of the LDPC decoder. The 
first property is the conditional independence of errors, which states that the error-correction 
performance of the decoder is independent from the particular codeword that was transmitted. 
The second property states that the error-correction performance of a particular LDPC code 
concentrates around the performance measured on a cycle-free graph, as the code length goes 
to infinity. 

Both properties were shown to hold in the context of reliable implementations [28]. It was 
also shown that the conditional independence of errors always holds when the channel is output 
symmetric and the decoder has a symmetry property. We can define a sufficient symmetry 
property of the decoder in terms of a message-update function Fij that represents one complete 
iteration of the (ideal) decoding algorithm. Given a vector of all the messages sent from 

variable nodes to check nodes at the start of iteration t and the channel information associated 

with variable node i, Fij returns the next ideal message to be sent from a variable node i to a 
check node j: = Fij 


Definition 1. A message-update function Fij is said to be symmetric with respect to a code C 

if 



(t) 



XiFij XpL 




XiV: 


( 0 ) 


for any any z/, 


,(o) 


and any codeword x G C. 


In other words, a decoder’s message-update function is symmetric if multiplying all the VN- 
to-CN belief messages sent at iteration t and the belief priors by a valid codeword x G C is 
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equivalent to multiplying the next messages sent at iteration t + 1 by that same codeword. Note 
that the symmetry condition in Definition 1 is implied by the check node and variable node 
symmetry conditions in [28, Def. 1]. 


B. Applicability of Density Evolution 

In order to use density evolution to predict the performance of long finite-length codes, the 
decoder must satisfy the two properties stated in Section IV-A, namely the conditional indepen¬ 
dence of errors and the convergence to the cycle-free case. We first present some properties of 
the decoding algorithm and of the deviation model that are sufficient to ensure the conditional 
independence of errors. 

Using the multiplicative description of the BIAWGN channel, the vector received by the 
decoder is given hy y = xz when a codeword x is transmitted, or by y = z when the all- 
one codeword is transmitted. In a reliable decoder, messages are completely determined by the 
received vector, but in a faulty decoder, there is additional randomness that results from the 
deviations. Therefore, we represent messages in terms of conditional probability distributions 
given xz. Since we are concerned with a fixed-point circuit implementation of the decoder, we 
can assume that messages are integers from the set {—Q, —Q -f 1 ,..., Q}, where Q > 0 is the 
largest message magnitude that can be represented. 


Definition 2. We say that a message distribution .\y{p,\xz) is symmetric if 

id I ~ ^ri,j\y i^id I ■ 


If a message has a symmetric distribution, its error probability as defined in (4) is the same 
whether xz ox z h received. Similarly to the results presented in [14], we can show that the 
symmetry of message distributions is preserved when the message-update function is symmetric. 

Lemma 1. If Fi^ is a symmetric message-update function and if pf'* and p!fi have symmetric 
distributions for all the next ideal messages also have symmetric distributions. 


Proof: We can express the distribution of the next ideal message from VN i to CN j as 




( 6 ) 
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where R = : Fij 

Assuming that the elements of the VN-to-CN message vector are independent and that 
each /i-j has a symmetric distribution, 

k k 

and since the channel output also has a symmetric distribution, 

I I ■ 

Therefore, we can rewrite (6) as 

I = Yl I 2 ) 2 ) • (7) 

Finally, letting and z/' = Xi^f\ (7) becomes 

I 0MW|y(A^' I I ’ 

F'R)^R' 

where R' = {(/i', r'-) : Fij{x^',Xiv[) = z/}. Since Fjj is symmetric, we can also express i?' as 

R' = {( m ', Fi) = Xiu} , 

and therefore, 

^4T'^\y \z)= Y1 I z) 0^(o)|y(^'i I z) 

{y’R)&R' 

= 0^(*+i)|„(^ I 

ly 

indicating that the next ideal messages have symmetric distributions. ■ 

To establish the conditional independence of errors under the proposed deviation model, we 
first define some properties of the deviation. 
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Note that if the model satisfies the symmetry condition, it also satisfies the weak symmetry 
condition, since Xi G { — 1,1}. We then have the following Lemma. 

Lemma 2 . If a decoder having a symmetric message-update function and taking its inputs 
from an output-symmetric communication channel is affected by weakly symmetric deviations, 
its message error probability at any iteration t >D is independent of the transmitted codeword. 


Proof: Similarly to the approach used in [29, Lemma 4.90] and [9], we want to show that 
the probability that messages are in error is the same whether xz or z is received. This is the 
case if the faulty messages /i-*] have a symmetric distribution for alH > 0 and all (f, j). 

Since the communication channel is output symmetric and since no deviations can occur 
before the first iteration, messages have a symmetric distribution. We proceed by 

induction to establish the symmetry of the messages for t > 0. We start by assuming that 




also holds for t > 0. 

Using Definition 4 and (8), we can write the faulty message distribution as 


Q 

id \xz)= id I xz) (z/ I xz) 

u=-Q 

Q 

= Y I ^ufhy I 

id=-Q 

XiQ 

u'=—XiQ 

Q 

= Y I ^d^hy I 


v'=-Q 


( 8 ) 


{XiP I Z) ■ 

where the third equality is obtained using the substitution u' = XiU. We conclude that the faulty 
messages have a symmetric distribution. Finally, since the decoder’s message-update function is 
symmetric. Lemma 1 confirms the induction hypothesis in (8). ■ 

The last remaining step in establishing whether density evolution can be used with a decoder 
affected by WS deviations is to determine whether the error-correction performance of a code 
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concentrates around the cycle-free case. The property has been shown to hold in [9] (Theorems 
2, 3 and 4) for an LDPC decoder affected by “wire noise” and “computation noise”. The wire 
noise model is similar to our deviation model, in the sense that the messages are passed through 
an additive noise channel, and that the noise applied to one message is independent of the 
noise applied to other messages. The proof presented in [9] only relies on the fact that the wire 
noise applied to a given message can only affect messages that are included in the directed 
neighborhood of the edge where it is applied, where the graph direction refers to the direction 
of message propagation. This clearly also holds in the case of our deviation model, and therefore 
the proof is the same. 

Since the message error probability is independent of the transmitted codeword, and further¬ 
more concentrates around the cycle-free case, density evolution can be used to determine the 
error-correction performance of a decoder perturbed by our deviation model, as long as the 
deviations are weakly symmetric. 

C. Deviation Examples 

As described in Section III-D, we collect deviation measurements from the test circuits by 
inputting test vectors representing random codewords, and distributed according to several 
values. We then generate estimates of the conditional distributions in (5). It is interesting to 
visualize the distributions using an aggregate measure such as the probability of observing a 
non-zero deviation 



These conditional probabilities are shown for a (3, 30) circuit in Fig. 4. When Xi = 1, positive 
belief values indicate a correct decision, whereas when Xi = —1, negative belief values indicate 
a correct decision. We can see that in this example, deviations are more likely when the belief 
is incorrect than when it is correct, and therefore a symmetric deviation model is not consistent 
with these measurements. On the other hand, there is a sign symmetry between the “correct” 
part of the curves, and between the “incorrect” parts, that is , 1) = Pnz(—, —1), and 

for this reason a weakly symmetric model is consistent with the measurements. Note that the 
slight jaggedness observed for incorrect belief values of large magnitude in the pe~^'^ = 0.008 
curves is due to the fact that these Uij values occur only rarely. For the largest incorrect Uij 
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Fig. 4. Non-zero deviation probability given and Xi at two values, measured on a (3,30) circuit operated at 

Vdd = 0.75 V and Tdk = 3.2 ns. 3 ■ 10® decoding iteration trials were performed for each value. The total number of 

non-zero deviation events observed is 4,115,229 at = 0.015, and 10,071,810 at = 0.008. 


values, only about 100 deviation events are observed for eaeh point, despite the large number 
of Monte-Carlo (MC) trials. 

Figure 5 shows a similar plot for a (3,6) eircuit. In this ease, 
and a symmetrie deviation model could be appropriate. Of course, since it is more general, a 
WS model is also appropriate. 

Under the assumption that deviations are weakly symmetric, we have 


Therefore, we can combine the Xj = 1 and Xj = — 1 data to improve the accuracy of the estimated 
distributions. 

Let pl and pn be respectively the smallest and largest pe~^'^ values for which the deviations 
have been characterized. We can generate a conditional distribution for any pe~^^ G [pl^Ph] by 
interpolating from the nearest distributions that have been measured. We choose pn > to 
make sure that the first iteration’s deviation is within the characterized range. Because messages 
in the decoder are saturated once they reach the largest magnitude that can be represented, the 
circuit’s switching activity decreases when the message error probability becomes very small. 
Since timing faults cannot occur when the circuit does not switch, we can expect deviations 
to be equally or less likely at pe~^^ values below pp- Therefore, to define the deviation model 
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Fig. 5. Non-zero deviation probability given and Xi at two values, measured on a (3,6) circuit operated at 

Vdd = 0.85 V and Tdk = 2.1ns. 3 ■ 10® decoding iteration trials were performed for each value. The total number of 

non-zero deviation events observed is 2,524,601 at = 0.09, and 1,020,867 at = 0.05. 


for pe < Pl, we make the pessimistie assumption that the deviation distribution remains the 
same as for pe~^^ = Pl- 

D. DE and Energy Curves 

We evaluate the progress of the decoder affected by timing violations using quantized density 
evolution [30]. For the Offset Min-Sum algorithm, a DE iteration can be split into the following 
steps: 1-a) evaluating the distribution of the CN minimum, 1-b) evaluating the distribution of 
the CN output, after subtracting the offset, 2) evaluating the distribution of the ideal VN-to-CN 
message, and 3) evaluating the distribution of the faulty VN-to-CN messages. Step 1-a is given in 
[15], while the others are straightforward. In the context of DE, we write the message distribution 
as 77^*) = (p{pfj\xi = 1), and the channel output distribution as = (p{pf'*\xi = 1). We write 
a DE iteration as 77*^°^). 

As mentioned in Section III-D, the energy consumption is modeled in terms of the message 
error probability and of the operating condition, and denoted c,y{pe'^). As for the deviation model, 
we use interpolation to define for p^f' G [pl-,Ph\, and assume that = c,-^{pl) for 

pT' < Pl- To display and c.y{pe^) on the same plot, we project onto the 

message error probability space. 
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Fig. 6. Examples of projected DE curves (solid lines) and energy curves (dashed lines) for rate 0.5 ensembles with € {3,4, 5}, 
and = 0.09. 



Fig. 7. Examples of projected DE curves (solid lines) and energy curves (dashed lines) for the (3, 30) and (4,40) ensembles 
(rate 0.9), with = 0.015. 


Several regular eode ensembles were evaluated, with rates ^ and Fig. 6 shows examples of 
projeeted DE eurves and energy eurves for rate-| eode ensembles with dy G {3,4,5} and various 
operating eonditions. The energy is measured as deseribed in Appendix A and eorresponds to 
one use of the test eireuit (shown in Fig. 8). The nominal operating eondition is Vdd = 1-0 V, 
Tcik = 2.0 ns and therefore these eurves eorrespond to a reliable implementation. With a reliable 
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implementation, these ensembles have a ehannel threshold of < 0.12 for the (3,6) ensemble, 
< 0.11 for (4, 8), and < 0.09 for (5,10). We use p® = 0.09 for all the eurves shown 
in Fig. 6 to allow eomparing the ensembles. As can be expected, a larger variable node degree 
results in faster convergence towards zero error rate, and it is natural to ask whether this property 
might provide greater fault tolerance and ultimately better energy efficiency. This is discussed 
in Section V-D. 

Fig. 7 is a similar plot for the (3, 30) and (4,40) ensembles. The channel threshold of both 
ensembles is approximately < 0.019. For these curves, the nominal operating condition is 
14d = 1.0 V and T^k = 3 ns. As we can see, the energy consumption per iteration of the (4,40) 
decoder is roughly double that of the (3,30) decoder. We note that in the case of the (3,30) 
ensemble, the reliable decoder stops making progress at an error probability of approximately 
10“®. This floor is the result of the message saturation limit chosen for the circuit. 

V. Energy Optimization 

A. Design Parameters 

As in a standard LDPC code-decoder design, the first parameter to be optimized is the choice 
of code ensemble. In this paper we restrict the discussion to regular codes, and therefore we 
need only to choose a degree pair dc), where R= 1 — d^/dc is the design rate of the code. 
For a fixed R, we can observe that both the energy consumption and the circuit area of the 
decoding circuit grow rapidly with d^, and therefore it is only necessary to consider a few of 
the lowest d^ values. 

Besides the choice of ensemble, we are interested in finding the optimal choice of operating 
parameters for the quasi-synchronous circuit. We consider here the supply voltage (Vdd) and the 
clock period (Tdk). Generally speaking, the supply voltage affects the energy consumption, while 
the clock period affects the decoding time, or latency. The energy and latency are also affected 
by the choice of code ensemble, since the number of operations to be performed depends on the 
node degrees. The operating parameters of a decoder are denoted as a vector 7 — [Kid, Tc\k\- 

The decoding of LDPC codes proceeds in an iterative fashion, and it is therefore possible 
to adjust the operating parameters on an iteration-by-iteration basis. In practice, this could be 
implemented in various ways, for example by using a pipelined sequence of decoder circuits, 
where each decoder is responsible for only a portion of the decoding iterations. It is also possible 
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to rapidly vary the clock frequency of a given circuit by using a digital clock divider circuit [31]. 
We denote by 7 the sequence of parameters used at each iteration throughout the decoding, and 
we use 7 = , 7^^,... ] to denote a specific sequence in which the parameter vector 71 is 

used for the first A^i iterations, followed by 72 for the next N 2 iterations, and so on. 

B. Objective 

The performance of the LDPC code and of its decoder can be described by specifying a vector 
P = Tdec), whcrc is the output error rate of the communication channel, Pres the 

residual error rate of VN-to-CN messages when the decoder terminates, and Tdec the expected 
decoding latency. 

The decoder’s performance P and energy consumption E are controlled by 7 . The energy 
minimization problem can be stated as follows. Given a performance constraint P = {a,b,c), 
we wish to find the value of 7 that minimizes E, subject to > a, pres < b, Tdec < c. As 
in the standard DE method, we propose to use the code’s computation tree as a proxy for the 
entire decoder, and furthermore to use the energy consumption of the test circuit described in 
Appendix B as the optimization objective. To be able to replace the energy minimization of 
the complete decoder with the energy minimization of the test circuit, we make the following 
assumptions: 

1) The ordering of the energy consumption is the same for the test circuit and for the complete 
decoder, that is, for any 71 and 72, Ttest( 7 i) < -Etest( 72 ) implies Tdec( 7 i) < -Edec( 72 ), 
where Ttest( 7 ) and Tdec (7) are respectively the energy consumption of the test circuit 
and of the complete decoder when using parameter 7. 

2) The average message error rate in the test circuit and in the complete decoder is the same 
for all decoding iterations. 

3) The latency of the complete decoder is proportional to the latency of the test circuit, that 
is, if Tdec (7) is the latency measured using the test circuit with parameter 7, the latency 
of the complete decoder is given by / 3 Tdec( 7 ), where does not depend on 7. 

Assumption 1 is reasonable because the test circuit is very similar to a computation unit used 
in the complete decoder. The difference between the two is that the test circuit only instantiates 
one full VNP, the remaining {dc — 1) VNPs being reduced to only their “front” part (as seen 
in Fig. 8 ), whereas the complete decoder has dc full VNPs for every CNP Assumption 2 is the 
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Standard DE assumption, which is reasonable for sufficiently long codes. Finally, it is possible 
for the clock period to be slower in the complete decoder, because the increased area could 
result in longer interconnections between circuit blocks. Even if this is the case, the interconnect 
length only depends on the area of the complete decoder, which is not affected by the parameters 
we are optimizing, and hence /3 does not depend on 7. 

Clearly, if Assumption 1 holds and the performance of the test circuit is the same as the 
performance of the complete decoder, then the solution of the energy minimization is also the 
same. The performance is composed of the three components {p^e\pres,Tdec)- The channel error 
rate pf'^ does not depend on the decoder and is clearly the same in both cases. Because of 
Assumption 2, the complete decoder can achieve the same residual error rate as the test circuit 
when p^e^ is the same. The latencies measured on the test circuit and on the complete decoder 
are not necessarily the same, but if Assumption 3 holds, and if we assume that the constant /3 
is known, then we can find the solution to the energy minimization of the complete decoder 
subject to constraints (pi°\pres, Tdec) by instead minimizing the energy of the test circuit with 
constraints , p^es, Idee /P). 

We also consider another interesting optimization problem. It is well known that for a fixed 
degree of parallelism, energy consumption is proportional to processing speed (represented here 
by Tdec), which is observed both in the physical energy limit stemming from Heisenberg’s 
uncertainty principle [32], as well as in practical CMOS circuits [33]. In situations where both 
throughput normalized to area and low energy consumption are desired, optimizing the product 
of energy and latency or energy-delay product (EDP) for a fixed circuit area can be a better 
objective. In that case the performance constraint is stated in terms of P = {p^e\pres), and 
the optimization problem becomes the following: given a performance constraint P = (a, 6), 
minimize P(7) ■ Idee( 7 ) subject to > a, pres < b, and a fixed circuit area. 

C. Dynamic Programming 

To solve the iteration-by-iteration energy and EDP minimization problems stated above, we 
adapt the “Gear-Shift” dynamic programming approach proposed in [20]. The original method 
relies on the fact that the message distribution has a 1-D characterization, which is chosen to be 
the error probability. By quantizing the error probability space, a trellis graph can be constructed 
in which each node is associated with a pair {pe\t). Quantized quantities are marked with tildes. 
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TABLE I 

Energy and EDP optimization results. 


Standard Quasi-synchronous 


Code 

Nom. 

Norm. 


Pres 

Latency 

Energy 

EDP 

Best energy 

Best EDP 

family 

Plk 

area f 



[ns] 

[pJ] 

[nj • ns] 

[pJ] 

[nJ • ns] 

(3,6) 

2.0 ns 

1.066 

0.12* 

00 

1 

O 

VI 

66 

250 

16.5 

192 (-23%) 

12.7 (-23%) 




0.09 

00 

1 

O 

VI 

22 

68.2 

1.50 

45.0 (-34%) 

0.98 (-35%) 

(4,8) 

2.0 ns 

1.44 

0.09 

00 

1 

O 

VI 

18 

98.5 

1.77 

74.9 (-24%) 

1.33 (-25%) 

(3,30) 

3.0 ns 

1.099 

0.019* 

00 

1 

O 

VI 

84.0 

883 

74.2 

605 (-31%) 

48.6 (-35%) 


2.5 ns 

1.135 

0.019* 

00 

1 

O 

VI 

70.0 

916 

64.1 

664 (-28%) 

46.5 (-27%) 


3.0 ns 

1.099 

0.015 

00 

1 

O 

VI 

39.0 

306 

11.9 

196 (-36%) 

7.35 (-38%) 


2.5 ns 

1.135 

0.015 

00 

1 

O 

VI 

32.5 

324 

10.5 

214 (-34%) 

6.92 (-34%) 

(4,40) 

3.0 ns 

1.522 

0.015 

00 

1 

O 

VI 

27.0 

364 

9.83 

224 (-38%) 

5.93 (-40%) 


f Cell area divided by the minimal area of the smallest decoder having the same code rate. Approx, threshold. 


A particular choice of 7 corresponds to a path P through the graph, and the optimization is 
transformed into finding the least expensive path that starts from the initial state 0 ) and 
reaches any state {pe\t) such that < p^es and the latency constraint is satisfied, if there is 
one. Note that to ensure that the solutions remain achievable in the original continuous space, 
the message error rates pe'^ are quantized by rounding up. To maintain a good resolution at low 
error rates, we use a logarithmic quantization, with 1000 points per decade. 

In the case of a faulty decoder, we want to evaluate the decoder’s progress by tracking a 
complete message distribution using DE, rather than simply tracking the message error proba¬ 
bility. In this case, the Gear-Shift method can be used as an approximate solver by projecting 
the message distribution = (j){pfj\xi = 1) onto the error probability space. We refer to this 
method as DE-Gear-Shift. Any path through the graph is evaluated by performing DE on the 
entire path using exact distributions, but different paths are compared in the projection space. As 
a result, the solutions that are found are not guaranteed to be optimal, but they are guaranteed 
to accurately represent the progress of the decoder. 

In the DE-Gear-Shift method, a path P is a sequence of states As in the original 

Gear-Shift method, any sequence of decoder parameters 7 corresponds to a path. We denote 
the projection of a state onto the error probability space as To each path P, we 
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associate an energy cost Ep and a latency cost Tp. A path ending at a state can be extended 
with one additional decoding iteration using parameter 7 by evaluating one DE iteration to obtain 
7 r(i+i) = /^(7r(*\77*^°)). Performing this additional iteration adds an energy cost c^{pe\pl^^) and 
a latency cost Ty to the path’s cost. When optimizing EDP, we define the overall cost of a path 
Cp as Cp = Ep ■ Tp. When optimizing energy under a latency constraint, we define the path 
cost as a two-dimensional vector Cp = {Ep,Tp). 

We use the following rules to discard paths that are suboptimal in the error probability space. 
Rule 1: Paths for which the message error rate is not monotonically decreasing are discarded. Rule 
2: A path P with cost Cp is said to dominate another path P' with cost Cp/ if all the following 
conditions hold: 1) an ordering exists between Cp and Cp, 2) Cp < Cp, 3) ©(ttp) < 0(7rp/), 
where ttp denotes the last state reached by path P. The search for the least expensive path is 
performed breadth-first. After each traversal of the graph, any path that is dominated by another 
is discarded. 

When the path cost is one-dimensional, the optimization requires evaluating 0(|r|A^5) DE 
iterations, where |r| is the number of operating points being considered and Ng the number of 
quantization levels used for This can be seen from the fact that with a 1-D cost. Rule 2 

implies that at most one path can reach a given state Therefore, 0(|r|A^5) DE iterations are 

required for each decoding iteration. In addition, upper bounds can be derived for the number of 
decoding iterations spanned by the trellis graph in terms of the smallest latency and energy cost of 
the parameters in T, and therefore it is a constant that does not depend on |r| ox Ng. On the other 
hand, when the cost is two-dimensional, the number of DE iterations could grow exponentially in 
terms of the number of decoding iterations. However, even in the case of a 2-D cost, an ordering 
exists between the costs of paths P and P' if [Ep > Ep/ ATp > Tp/) V {Ep < Ep/ ATp < Tp/), 
and in that case Rule 2 can be applied. In practice, for the cases presented in this paper, the 
discarding rules allowed to keep the number of paths down to a manageable level, even when 
using a 2-D cost. Note that an alternative to the use of a 2-D cost is to define a 1-D cost as 
Cp = Ep + kTp, and to perform a binary search for the value of k that yields an optimal 
solution with the desired latency. 

The algorithm can also be modified to search for parameter sequences that have other desirable 
properties beyond minimal energy or EDR Eor example, if the decoder is implemented as a 
pipelined sequence of decoders, it can be desirable to favor solutions that do not require the 
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decoder to switch its parameters too often. We can find good approximate solutions by adding 
a penalty to Ep when the algorithm used in the current and next steps is different. 

D. Results 

We use DE-Gear-Shift to find good parameter sequences 7 for several regular ensembles with 
rates | and xhe parameter space E consists of (Edd,Tcik) points with f/dd from 0.70 V to 
1.0 V in steps of 0.05 V and several Tdk values depending on Vdd, in steps of 0.1 ns. The standard 
and quasi-synchronous decoders use the same circuits. Parameter a in (1) is set to a = 4 for the 
(3, 6 ), (3,30), and (4,40) decoders, and to a = 2 for the (4, 8 ) decoder. The offset parameter C 
in Alg. 1 is set to G = 2 for the (4,40) decoder and to G = 1 for all other decoders. As part of 
our best effort to design a good standard circuit, in the case of the (3,30) decoder we present 
results for two circuits synthesized with different nominal T^k values. The standard circuit has 
a lower energy consumption when synthesized with T^k = 3 ns, while it has a lower EDP when 
synthesized with Tdk = 2.5 ns. 

We first run the DE-Gear-Shift solver without any path penalties to obtain the best possible 
parameter sequences, for both the energy and the EDP objectives. We also noticed that in some 
cases, adding a small algorithm change penalty allows to discover slightly better sequences. Note 
that when the objective is EDP, there is no constraint on latency. These results are summarized in 
Table I, where the energy is normalized per check node. Overall, we see that significant gains are 
possible while achieving the same channel noise, latency, and residual error requirements. The 
synthesis results show that increasing while keeping the rate constant leads to a significant 
increase in circuit area. Despite this, increasing the node degrees can result in a reduction of 
the EDP. Eor the rate ensembles, going from = 3 to = 4 decreases EDP by 6.4% for 
a standard system, and by 14% for a quasi-synchronous system. However this is not the case 
for the rate ^ ensembles, where = S has the smaller EDP. As expected, we can also see that 
much more energy is required when the channel quality is close to the ensemble’s threshold. 

By applying a cost penalty to parameter switches, it is possible to find parameter sequences 
with few switches, without a large increase in cost. Eor example, for a (3, 6 ) decoder starting 
at = 0.09, a single operating condition can provide a 32% EDP improvement, using 7 = 
[[0.8 V, 2.1 ns]^^]. The probability of a non-zero deviation in that schedule ranges from 0.6% to 
7.2%. In the case of a (3,30) decoder synthesized at a nominal Tdk = 2.5 ns, for = 0.015 
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the sequence 7 = [[0.8 V, 2.5 ns]^^, [1.0 V, 2.5 ns]] provides a 30% EDP improvement, with non¬ 
zero deviation probabilities from 0 to 0.8%. For a (4,40) decoder, the single-parameter sequence 
7 = [[0.8 V, 2.8 ns]®] provides a 39% EDP improvement, with non-zero deviation probabilities 
from 1.6 to 4.6%. 


VI. Conclusion 

We presented a method for the design of synchronous circuit implementations of signal 
processing algorithms that permits timing violations without the need for hardware compensation. 
We introduced a model for the deviations occurring in EDPC decoder circuits affected by timing 
faults that represent the circuit behavior accurately [27], while being independent of the iterative 
progress of the decoder. In addition, we showed that in order to use density evolution to predict 
the performance of the faulty decoder, it is sufficient for the deviation model to have a weak 
symmetry property, which is more general than previously proposed sufficient properties. 

We then presented an approximate optimization method called DE-Gear-Shift to find sequences 
of circuit operating parameters that minimize the energy or the energy-delay product. The method 
is similar to the previously proposed Gear-Shift method, but relies on density evolution rather than 
ExIT charts to evaluate the average iterative progress of the decoder. Our results show that the 
best energy or EDP reduction is achieved by operating the circuit with a large number of timing 
violations (often with an average probability of non-zero deviation above 1%). Furthermore, 
important savings can be achieved with few parameter switches, and without any compromise 
on circuit area or decoding performance. 

In this work, we only considered delay variations associated with the signal transitions at 
the input of the circuit. While the energy savings that result from tolerating these variations are 
already significant, we ultimately see quasi-synchronous systems as an approach for tolerating the 
large process variations found in near-threshold CMOS circuits and other emerging computing 
technologies, potentially enabling energy savings of an order of magnitude. Furthermore, we 
believe this approach can be extended to other self-correcting algorithms, such as deep neural 
networks. 
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Appendix A 
CAD Workflow 

The deviations and the energy eonsumption are measured direetly on optimized eireuit models 
generated by a eommereial synthesis tool (Cadence Encounter [34]). We use TSMC’s 65 nm 
proeess with the tcbn65gplus eell library [35]. In order to provide a fair assessment of the 
improvements provided by the quasi-synehronous eireuit, we first synthesize a benchmark eireuit 
that represents a best effort at optimizing the metric of interest, for example energy consumption. 
Since we do not have a specific throughput constraint for the design, we synthesize the benchmark 
circuit at the standard supply voltage of the library (Vdd = 1-OV), while the clock period is chosen 
as small as possible without causing a degradation of the target metric. Second, we synthesize 
a nominal circuit that will serve as the basis for the quasi-synehronous design. In this work, we 
use a standard synthesis algorithm for the nominal circuit, and in all the cases that we report on, 
the nominal and the benchmark circuits are actually the same. Using a standard synthesis method 
for the nominal circuit allows using off-the-shelf tools, but is not ideal since the objective of a 
standard synthesis algorithm (to make all paths only as fast as the clock period) differs from 
the objective pursued when some timing violations are permitted. For example, results in [36] 
show that the power consumption of a circuit can be reduced by up to 32% when the gate-sizing 
optimization takes into account the acceptable rate of timing violations. Therefore it is possible 
that our results could be improved by using a different synthesis algorithm. 

Once the circuit is synthesized, we perform a static timing analysis of the gate-level model 
at various supply voltages. All timing analyses (including at the nominal supply) are performed 
using timing libraries generated by the Cadence Encounter Library Characterization tool. 
We then use this timing information in a functional simulation of the gate-level circuit to observe 
the dynamic effect of path delay variations and measure the deviation statistics. Any source of 
delay variation that can be simulated can be studied, but in this paper we focus on variations due 
to path activation, that is the variations in delay caused by the different propagation times required 
by different input transitions. Note that other methods could be used to obtain the propagation 
delays, such as the method described in [37] based on analytical models. In addition to speeding 
up the characterization, such methods allow considering the effect of process variations. 

Power estimation is performed by collecting switching activity data in the functional simulation 
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and using the power estimation engine in Cadence Encounter. However, because the circuit is 
operated in a quasi-synchronous manner, the clock period used to run the circuit is not necessarily 
the same as the nominal clock period. When that is the case, the power estimation generated 
by the synthesis tool cannot be used directly. First, the switching activity recorded during the 
functional simulation must be scaled so that it corresponds to the nominal clock period. The 
tool’s power estimation then reports the dynamic power Pdyn and the static power Pstat- The 
dynamic energy consumed during one clock cycle does not depend on the clock period, whereas 
the static energy does. Therefore, the total energy consumed during one cycle by the quasi- 
synchronous circuit is given by Pcyde = -PdynTdk.nom + -PstatTcik, where Tcik,nom is the nominal 
clock period and Tdk is the actual clock period used to run the circuit. 

Appendix B 

Test Circuit Monte-Carlo Simulation 

A suitable test circuit for a row-layered decoder architecture consists in implementing a single 
check node processor, as well as the necessary logic taken from the variable node processor block 
to send messages to the CNP, and receive one message from the CNR This test circuit is 
shown in Fig. 8. It re-uses logic blocks that are found in the complete decoder, ensuring the 
accuracy of the deviation and energy measurements, and minimizing design time. 

The test circuit is used to evaluate the decoder’s computation tree (shown in Fig. 3). The 
VNP with index 1, shown at the top, is always mapped to the VN that is at the head of the 
computation tree, while the VNPs at the bottom of the figure are mapped to different VNs as 
the CNP is successively mapped to each CN neighbor of the head VN. 

At any given clock cycle, a VNP front block is mapped to a particular VN i. For illustrative 
purposes, we simply index the VN neighbors from 1 to dc, even if the VNs mapped to the bottom 
VNPs actually change at each layer. Each VNP front block takes as input the previous belief 
total of that VN A', and the previous CN-to-VN message corresponding to layer i, 

To perform the Monte-Carlo simulation, a VNP front circuit block with index i must send a 
message randomly generated according to a 1-D normal distribution with error probability 
Pe \ However, the only inputs that are controllable are A' and To simplify the Monte- 

Carlo simulation, we disregard the true distribution of and generate it according to a 1-D 

normal distribution. We also introduce another simplification: we assume that messages received 
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Fig. 8. Block diagram of the test circuit. 

at a VN only modify the total belief at the end of the iteration, as would be the case when 
using a flooding schedule. As a result, the messages /if ^ are identically distributed with error 
rate parameter pf ^ for all i. Note that these simplifications are not necessary, and they could be 
removed at the cost of a slightly more cumbersome Monte-Carlo simulation. 

To generate inputs with the appropriate distribution, we use the fact that A • = /if 
On a cycle-free Tanner graph, /ifand Af^ are independent, but naturally A • and Af 
are not. Therefore, we generate /ifj(j£) and Afjff^ and sum them to obtain A'. 

To complete the DE iteration, we want to measure an extrinsic message belonging to the 
next iteration. Because we assume a flooding schedule, this extrinsic message can be obtained 
by summing any set of ((i„ — 1) messages in the current iteration. To achieve this, we start a 
DE iteration by setting A'^ 0 and Af 0 for all i. The desired extrinsic message then 

corresponds to the total belief output of the circuit Af^ after dy — 1 layers have been evaluated. 

Just like the processor used in the complete decoder, the test circuit has one input and one 
output register, as well as one internal pipeline register, for a latency of 3 clock cycles. In order 
to keep the pipeline fed, several distinct computation trees are evaluated in parallel during the 
Monte-Carlo simulation. 
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